1 Reading the Data

1.1 The readr approach

# readr produces a tibble
# which then shows in the printout the number of rows, cols and types of the cols
# we gained some valuable information when we did that
bike = readr::read_csv('https://raw.githubusercontent.com/fmegahed/isa401/main/data/bike_sharing_data.csv')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 17379 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): datetime, sources
## dbl (11): season, holiday, workingday, weather, temp, atemp, humidity, winds...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.2 The base R approach

bike_base = read.csv('https://raw.githubusercontent.com/fmegahed/isa401/main/data/bike_sharing_data.csv')
dplyr::glimpse(bike_base) # if you were to use the base r read.csv; you will also need to fix humidity
## Rows: 17,379
## Columns: 13
## $ datetime   <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/201…
## $ season     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ workingday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather    <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp       <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13.12…
## $ atemp      <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, 12.…
## $ humidity   <chr> "81", "80", "80", "75", "75", "75", "80", "86", "75", "76",…
## $ windspeed  <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, 0.0…
## $ casual     <int> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <int> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ count      <int> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
## $ sources    <chr> "ad campaign", "www.yahoo.com", "www.google.fi", "AD campai…

2 Toward a Technically Correct Dataset

2.1 Approach 1

2.1.1 dplyr::glimpse() or str()

dplyr::glimpse(bike) # type of each variable, first few observations
## Rows: 17,379
## Columns: 13
## $ datetime   <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/201…
## $ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather    <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp       <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13.12…
## $ atemp      <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, 12.…
## $ humidity   <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, 72,…
## $ windspeed  <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, 0.0…
## $ casual     <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ count      <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
## $ sources    <chr> "ad campaign", "www.yahoo.com", "www.google.fi", "AD campai…

2.1.2 Are the column names reasonable?

Yes, therefore we will not change the column names.

2.1.3 Are the types/classes for each column reasonable?

The classes for some of the columns will need to change. So let us change them.

# datetime: it is currently a chr vector and we will change it accordingly
# this will convert the column into a datetime column and we knew the appropriate
# function from lubridate by seeing the values in that column and going down the rows
# until we saw that the second number had 13 and 14 in it which means that it cannot
# be a month
bike$datetime = lubridate::mdy_hm(bike$datetime)

# we will now change the columns season to weather to factor
# we could have done this in a way similar to line 53, but we will try to be
# "lazy" and do it in one/two steps (instead of four)
bike = bike |> 
  # converted the columns season:weather into chr
  dplyr::mutate(dplyr::across(.cols = season:weather, .fn = as.character) ) |>
  # converted them into a factor (could have also used forcats::as_factor instead)
  dplyr::mutate(dplyr::across(.cols = season:weather, .fn = as.factor) )

dplyr::glimpse(bike)
## Rows: 17,379
## Columns: 13
## $ datetime   <dttm> 2011-01-01 00:00:00, 2011-01-01 01:00:00, 2011-01-01 02:00…
## $ season     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ workingday <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weather    <fct> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp       <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13.12…
## $ atemp      <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, 12.…
## $ humidity   <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, 72,…
## $ windspeed  <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, 0.0…
## $ casual     <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ count      <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
## $ sources    <chr> "ad campaign", "www.yahoo.com", "www.google.fi", "AD campai…

Our dataset now has reasonable column names (we did not change that) and correct column types. Therefore, our data is now technically correct.

2.2 Approach 2

2.2.1 The pointblank package

library(pointblank) # we will load it since we will be using a bunch of funtions from it
# this will save some time pointblank::fun_name

# rereading the data since we already fixed bike 
bike_tbl = readr::read_csv('https://raw.githubusercontent.com/fmegahed/isa401/main/data/bike_sharing_data.csv')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 17379 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): datetime, sources
## dbl (11): season, holiday, workingday, weather, temp, atemp, humidity, winds...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# decimal means that it is the pct of rows, while a number means number of rows
act = action_levels(warn_at = 0.01, notify_at = 0.01, stop_at = 0.01)
act # we do not need to set them all the same (it is done here for the example)
## -- The `action_levels` settings
## WARN failure threshold of 0.01 of all test units.
## STOP failure threshold of 0.01 of all test units.
## NOTIFY failure threshold of 0.01 of all test units.
## ----
agent = 
  # create agent starts the table of checks
  create_agent(tbl = bike_tbl, actions = act) |>
  # we will populate the rows without doing the checks (checks are done in the interrogation step)
  col_is_date(columns = datetime) |> 
  # given that the columns were consecutive I can use col_start:col_end
  # otherwise you need to use the function, see col_if_... for details
  col_is_factor(columns = season:weather) |> 
  # we do not really care if they are dbl vs int but I am showing you that
  # if you did care, we can check for that
  col_is_numeric(columns = temp:windspeed) |> 
  col_is_integer(columns = casual:count) |> 
  col_is_character(columns = vars(sources))

res = interrogate(agent = agent, sample_limit = 5000)
res
Pointblank Validation
[2023-10-03|09:52:31]

tibble bike_tblWARN 0.01 STOP 0.01 NOTIFY 0.01
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_date
 col_is_date()

datetime

1 0
0.00
1
1.00

2
col_is_factor
 col_is_factor()

season

1 0
0.00
1
1.00

3
col_is_factor
 col_is_factor()

holiday

1 0
0.00
1
1.00

4
col_is_factor
 col_is_factor()

workingday

1 0
0.00
1
1.00

5
col_is_factor
 col_is_factor()

weather

1 0
0.00
1
1.00

6
col_is_numeric
 col_is_numeric()

temp

1 1
1.00
0
0.00

7
col_is_numeric
 col_is_numeric()

atemp

1 1
1.00
0
0.00

8
col_is_numeric
 col_is_numeric()

humidity

1 1
1.00
0
0.00

9
col_is_numeric
 col_is_numeric()

windspeed

1 1
1.00
0
0.00

10
col_is_integer
 col_is_integer()

casual

1 0
0.00
1
1.00

11
col_is_integer
 col_is_integer()

registered

1 0
0.00
1
1.00

12
col_is_integer
 col_is_integer()

count

1 0
0.00
1
1.00

13
col_is_character
 col_is_character()

sources

1 1
1.00
0
0.00

2023-10-03 09:52:31 EDT < 1 s 2023-10-03 09:52:31 EDT

2.2.2 Fix columns based on the warnings

Similar to approach 1 in the manual_fixing code chunk, so we will not show it in class.

LS0tDQp0aXRsZTogIlRlY2huaWNhbGx5IENvcnJlY3QgYW5kIENvbnNpc3RlbnQgRGF0YSINCmF1dGhvcjogIkZhZGVsIE1lZ2FoZWQiDQpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiDQpvdXRwdXQ6IA0KICBodG1sX2RvY3VtZW50Og0KICAgIGNvZGVfZG93bmxvYWQ6IFRSVUUNCiAgICBjb2RlX2ZvbGRpbmc6IHNob3cNCiAgICB0b2M6IFRSVUUNCiAgICB0b2NfZmxvYXQ6IFRSVUUNCiAgICBudW1iZXJfc2VjdGlvbnM6IFRSVUUNCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQojIFJlYWRpbmcgdGhlIERhdGENCg0KIyMgVGhlIGByZWFkcmAgYXBwcm9hY2gNCg0KYGBge3IgcmVhZF9kYXRhfQ0KIyByZWFkciBwcm9kdWNlcyBhIHRpYmJsZQ0KIyB3aGljaCB0aGVuIHNob3dzIGluIHRoZSBwcmludG91dCB0aGUgbnVtYmVyIG9mIHJvd3MsIGNvbHMgYW5kIHR5cGVzIG9mIHRoZSBjb2xzDQojIHdlIGdhaW5lZCBzb21lIHZhbHVhYmxlIGluZm9ybWF0aW9uIHdoZW4gd2UgZGlkIHRoYXQNCmJpa2UgPSByZWFkcjo6cmVhZF9jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9mbWVnYWhlZC9pc2E0MDEvbWFpbi9kYXRhL2Jpa2Vfc2hhcmluZ19kYXRhLmNzdicpDQpgYGANCg0KIyMgVGhlIGBiYXNlIFJgIGFwcHJvYWNoDQpgYGB7ciByZWFkX2RhdGFfYmFzZX0NCmJpa2VfYmFzZSA9IHJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZm1lZ2FoZWQvaXNhNDAxL21haW4vZGF0YS9iaWtlX3NoYXJpbmdfZGF0YS5jc3YnKQ0KZHBseXI6OmdsaW1wc2UoYmlrZV9iYXNlKSAjIGlmIHlvdSB3ZXJlIHRvIHVzZSB0aGUgYmFzZSByIHJlYWQuY3N2OyB5b3Ugd2lsbCBhbHNvIG5lZWQgdG8gZml4IGh1bWlkaXR5DQpgYGANCg0KDQojIFRvd2FyZCBhIFRlY2huaWNhbGx5IENvcnJlY3QgRGF0YXNldA0KDQojIyBBcHByb2FjaCAxDQoNCiMjIyBgZHBseXI6OmdsaW1wc2UoKWAgb3IgYHN0cigpYA0KDQpgYGB7ciBxdWlja19jaGVja3N9DQpkcGx5cjo6Z2xpbXBzZShiaWtlKSAjIHR5cGUgb2YgZWFjaCB2YXJpYWJsZSwgZmlyc3QgZmV3IG9ic2VydmF0aW9ucw0KYGBgDQoNCiMjIyBBcmUgdGhlIGNvbHVtbiBuYW1lcyByZWFzb25hYmxlPw0KDQpZZXMsIHRoZXJlZm9yZSB3ZSB3aWxsIG5vdCBjaGFuZ2UgdGhlIGNvbHVtbiBuYW1lcy4NCg0KIyMjIEFyZSB0aGUgdHlwZXMvY2xhc3NlcyBmb3IgZWFjaCBjb2x1bW4gcmVhc29uYWJsZT8NCg0KVGhlIGNsYXNzZXMgZm9yICoqc29tZSoqIG9mIHRoZSBjb2x1bW5zIHdpbGwgbmVlZCB0byBjaGFuZ2UuIFNvIGxldCB1cyBjaGFuZ2UgdGhlbS4NCg0KYGBge3IgbWFudWFsX2ZpeGluZ30NCiMgZGF0ZXRpbWU6IGl0IGlzIGN1cnJlbnRseSBhIGNociB2ZWN0b3IgYW5kIHdlIHdpbGwgY2hhbmdlIGl0IGFjY29yZGluZ2x5DQojIHRoaXMgd2lsbCBjb252ZXJ0IHRoZSBjb2x1bW4gaW50byBhIGRhdGV0aW1lIGNvbHVtbiBhbmQgd2Uga25ldyB0aGUgYXBwcm9wcmlhdGUNCiMgZnVuY3Rpb24gZnJvbSBsdWJyaWRhdGUgYnkgc2VlaW5nIHRoZSB2YWx1ZXMgaW4gdGhhdCBjb2x1bW4gYW5kIGdvaW5nIGRvd24gdGhlIHJvd3MNCiMgdW50aWwgd2Ugc2F3IHRoYXQgdGhlIHNlY29uZCBudW1iZXIgaGFkIDEzIGFuZCAxNCBpbiBpdCB3aGljaCBtZWFucyB0aGF0IGl0IGNhbm5vdA0KIyBiZSBhIG1vbnRoDQpiaWtlJGRhdGV0aW1lID0gbHVicmlkYXRlOjptZHlfaG0oYmlrZSRkYXRldGltZSkNCg0KIyB3ZSB3aWxsIG5vdyBjaGFuZ2UgdGhlIGNvbHVtbnMgc2Vhc29uIHRvIHdlYXRoZXIgdG8gZmFjdG9yDQojIHdlIGNvdWxkIGhhdmUgZG9uZSB0aGlzIGluIGEgd2F5IHNpbWlsYXIgdG8gbGluZSA1MywgYnV0IHdlIHdpbGwgdHJ5IHRvIGJlDQojICJsYXp5IiBhbmQgZG8gaXQgaW4gb25lL3R3byBzdGVwcyAoaW5zdGVhZCBvZiBmb3VyKQ0KYmlrZSA9IGJpa2UgfD4gDQogICMgY29udmVydGVkIHRoZSBjb2x1bW5zIHNlYXNvbjp3ZWF0aGVyIGludG8gY2hyDQogIGRwbHlyOjptdXRhdGUoZHBseXI6OmFjcm9zcyguY29scyA9IHNlYXNvbjp3ZWF0aGVyLCAuZm4gPSBhcy5jaGFyYWN0ZXIpICkgfD4NCiAgIyBjb252ZXJ0ZWQgdGhlbSBpbnRvIGEgZmFjdG9yIChjb3VsZCBoYXZlIGFsc28gdXNlZCBmb3JjYXRzOjphc19mYWN0b3IgaW5zdGVhZCkNCiAgZHBseXI6Om11dGF0ZShkcGx5cjo6YWNyb3NzKC5jb2xzID0gc2Vhc29uOndlYXRoZXIsIC5mbiA9IGFzLmZhY3RvcikgKQ0KDQpkcGx5cjo6Z2xpbXBzZShiaWtlKQ0KYGBgDQoNCk91ciBkYXRhc2V0IG5vdyBoYXMgcmVhc29uYWJsZSBjb2x1bW4gbmFtZXMgKHdlIGRpZCBub3QgY2hhbmdlIHRoYXQpIGFuZCBjb3JyZWN0IGNvbHVtbiB0eXBlcy4gVGhlcmVmb3JlLCBvdXIgZGF0YSBpcyAqKm5vdyB0ZWNobmljYWxseSBjb3JyZWN0KiouIA0KDQoNCg0KDQojIyBBcHByb2FjaCAyDQoNCiMjIyBUaGUgYHBvaW50YmxhbmtgIHBhY2thZ2UNCg0KYGBge3IgcG9pbnRfYmxhbmtfY2hlY2tzfQ0KbGlicmFyeShwb2ludGJsYW5rKSAjIHdlIHdpbGwgbG9hZCBpdCBzaW5jZSB3ZSB3aWxsIGJlIHVzaW5nIGEgYnVuY2ggb2YgZnVudGlvbnMgZnJvbSBpdA0KIyB0aGlzIHdpbGwgc2F2ZSBzb21lIHRpbWUgcG9pbnRibGFuazo6ZnVuX25hbWUNCg0KIyByZXJlYWRpbmcgdGhlIGRhdGEgc2luY2Ugd2UgYWxyZWFkeSBmaXhlZCBiaWtlIA0KYmlrZV90YmwgPSByZWFkcjo6cmVhZF9jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9mbWVnYWhlZC9pc2E0MDEvbWFpbi9kYXRhL2Jpa2Vfc2hhcmluZ19kYXRhLmNzdicpDQoNCiMgZGVjaW1hbCBtZWFucyB0aGF0IGl0IGlzIHRoZSBwY3Qgb2Ygcm93cywgd2hpbGUgYSBudW1iZXIgbWVhbnMgbnVtYmVyIG9mIHJvd3MNCmFjdCA9IGFjdGlvbl9sZXZlbHMod2Fybl9hdCA9IDAuMDEsIG5vdGlmeV9hdCA9IDAuMDEsIHN0b3BfYXQgPSAwLjAxKQ0KYWN0ICMgd2UgZG8gbm90IG5lZWQgdG8gc2V0IHRoZW0gYWxsIHRoZSBzYW1lIChpdCBpcyBkb25lIGhlcmUgZm9yIHRoZSBleGFtcGxlKQ0KDQphZ2VudCA9IA0KICAjIGNyZWF0ZSBhZ2VudCBzdGFydHMgdGhlIHRhYmxlIG9mIGNoZWNrcw0KICBjcmVhdGVfYWdlbnQodGJsID0gYmlrZV90YmwsIGFjdGlvbnMgPSBhY3QpIHw+DQogICMgd2Ugd2lsbCBwb3B1bGF0ZSB0aGUgcm93cyB3aXRob3V0IGRvaW5nIHRoZSBjaGVja3MgKGNoZWNrcyBhcmUgZG9uZSBpbiB0aGUgaW50ZXJyb2dhdGlvbiBzdGVwKQ0KICBjb2xfaXNfZGF0ZShjb2x1bW5zID0gZGF0ZXRpbWUpIHw+IA0KICAjIGdpdmVuIHRoYXQgdGhlIGNvbHVtbnMgd2VyZSBjb25zZWN1dGl2ZSBJIGNhbiB1c2UgY29sX3N0YXJ0OmNvbF9lbmQNCiAgIyBvdGhlcndpc2UgeW91IG5lZWQgdG8gdXNlIHRoZSBmdW5jdGlvbiwgc2VlIGNvbF9pZl8uLi4gZm9yIGRldGFpbHMNCiAgY29sX2lzX2ZhY3Rvcihjb2x1bW5zID0gc2Vhc29uOndlYXRoZXIpIHw+IA0KICAjIHdlIGRvIG5vdCByZWFsbHkgY2FyZSBpZiB0aGV5IGFyZSBkYmwgdnMgaW50IGJ1dCBJIGFtIHNob3dpbmcgeW91IHRoYXQNCiAgIyBpZiB5b3UgZGlkIGNhcmUsIHdlIGNhbiBjaGVjayBmb3IgdGhhdA0KICBjb2xfaXNfbnVtZXJpYyhjb2x1bW5zID0gdGVtcDp3aW5kc3BlZWQpIHw+IA0KICBjb2xfaXNfaW50ZWdlcihjb2x1bW5zID0gY2FzdWFsOmNvdW50KSB8PiANCiAgY29sX2lzX2NoYXJhY3Rlcihjb2x1bW5zID0gdmFycyhzb3VyY2VzKSkNCg0KcmVzID0gaW50ZXJyb2dhdGUoYWdlbnQgPSBhZ2VudCwgc2FtcGxlX2xpbWl0ID0gNTAwMCkNCnJlcw0KYGBgDQoNCg0KIyMjIEZpeCBjb2x1bW5zIGJhc2VkIG9uIHRoZSB3YXJuaW5ncw0KU2ltaWxhciB0byBhcHByb2FjaCAxIGluIHRoZSBgbWFudWFsX2ZpeGluZ2AgY29kZSBjaHVuaywgc28gd2Ugd2lsbCBub3Qgc2hvdyBpdCBpbiBjbGFzcy4NCg0KDQo=